Welcome to R Math Camp

Bradley Katcher



Note: I am grateful to Rony Rodriguez-Ramirez and previous Math Camp and API-209 TFs for their materials, of which these are heavily based.

Slides available on

About me:

Bradley Katcher

  • 3rd year PhD student at HKS (Econ track)

    • From Cleveland, Ohio
    • Worked at the Federal Reserve in Washington, DC
    • Did my MPP and MS Data Science at University of Virginia
  • Big sports fan, enjoy cooking, running, hiking, travel, home brewing, and economics!

  • Will be the Teaching Fellow (TF) for API-209 this semester

My Research Assistant:

Meet Bee!

If we are on a Zoom call, you may hear her in the background. She’s very chatty!

Bee

Introduce Course Assistants:

CAs, please give:

  1. Your name
  2. Favorite tip for learning R
  3. Recommendation of at least one non-academic thing to do this semester in Boston/Cambridge.

Goals for the course:

  • Ensure that you know enough R to tackle assignments this semester.

    • Will not teach you everything about coding: this is a public policy program, not a CS course.

    • Goal is to bring everyone to a similar playing field!

      • May be a little slow for experienced R users
      • May be fast for those of you who have never programmed.
      • Feedback is encouraged!
  • Know how to craft questions and where to look for answers.

Why R?

  • Allows you to connect theory to empirics (data!)

  • Open source, large community, and used by lots of organizations

    • More reproducible and systematic than using Excel

      • I used it when working for the Fed
  • Produces very nice data visualizations

I don’t want to waste your time:

  • Many of you have substantial R experience

  • This course might be too fast/slow for some of you

  • Can’t force you to be here!

  • Reasons why you might still want to be, even if you are an experienced user:

    • Good refresher

    • Build out your GitHub Repo

      • Can offer a session on this if interested
    • Will be using exercises that are based on problem sets/exams from API-209

Where to turn for help:

Layout of course:

What are we doing to do during Math Camp (R portion)?

8 sessions over the next two weeks:

1. 4 Lessons (2 hours):

  • I will discuss coding strategies and implementations. Mostly walk-through of examples.

2. Labs (1.5 hours):

  • A hands-on session where I provide you with a challenge and we solve them together.

  • A chance to practice the skills we learn in lessons.

3. Office hours (optional!)

Classroom Norms:

  • Everyone deserves the chance to contribute

  • Start with curiosity, not judgement

  • Acknowledge varying levels of experience

  • Practice collaborative problem solving

  • Ask questions, often! No such thing as a “dumb question”

  • Attempt first, then seek input

  • Keep feedback constructive and kind

  • Stay present and engaged

  • Use tools and online resources appropriately

  • Respect everyone’s time and energy

AI for Troubleshooting:

To use AI effectively, consider the following:

  1. Share a clear question with your code + error message
  2. Ask for help identifying the error
  3. Review the AI’s suggestion critically and make sure you understand the fix
  4. Always test solutions in your own R environment

Cite when you use AI for troubleshooting, and any code written by AI should include a comment by you describing which AI you used and what the code does in your own words.

Don’t let AI do the thinking for you:

AI gives answers, but you need a deeper understanding.

  • You can’t debug or adapt code you don’t understand.

    • Knowing how matters more than getting a quick fix.

AI isn’t always correct:

  • Makes up functions

  • Suggests outdated/inefficient code

  • Doesn’t know dataset or context

Outline:

  • Lesson 1: Welcome, overview, dplyr

    • Lab 1: Working with a dataset on how to compute summary stats, basic data analysis
  • Lesson 2: Installing R, Tidyerse, ggplot, working with other packages

    • Lab 2: Tidy Data and Visualization in R using packages
  • Lesson 3: Writing your own functions, Quarto

    • Lab 3: Writing functions to calculate some statistics
  • Lesson 4: Data wrangling with multiple datasets, merging and integrating multiple data sources

    • Lab 4: Exercise based on first half of problem set 1

Goals for today:

  1. Learn how to import data and do basic data manipulation and cleaning.
  1. Work with a real dataset, understanding code book and documentation
  1. Learn key components of the tidyverse.
  1. Understanding data cleaning and data wrangling.

Getting Started

Steps to working with data in R:

1. Import (read.csv() is your friend)

2. Tidy (dplyr and stringr are your friends)

3. Transform the data (tidyverse is your friend)

4. Visualize the data (ggplot is your friend)

5. Communicate the data (quarto is your friend)

Structure of an R file:

We begin by writing a preamble and loading in packages:

# filename.do
# Author: Bradley Katcher
# Date: 
# Purpose: 
# Description:


# packages:
# to install packages, we use the following function:
#install.packages('dplyr') # you only need to do this once
library(dplyr) # this calls the package, you need to do this every time

# storing key variables:



# begin analysis

Reading in data:

Ideally most data that you get will be in csv (delimited) format.

  • You had experience reading in csv files in the summer work using the read_csv() command.

For today’s exercise, we are going to work with a dataset on the global diffusion of latent nuclear capabilities, produced by Matthew Fuhrmann:

Check out his website here with codebook.

Reading in data:

It looks like this dataset is in xlsx format. To read that in, we will 1. Install the openxlsx package 2. Call the openxlsx package 3. Read in the dataset

  • There are multiple ways to do this. For example, you could download the file and use the readxl package.
  • You can load multiple datasets into R.
#install.packages('openxlsx')
library(openxlsx)

full_nuclear_data <- read.xlsx('http://www.matthewfuhrmann.com/uploads/2/5/8/2/25820564/nl_dataset_v.1.2.xlsx')

nuclear_country_year <- read.xlsx('http://www.matthewfuhrmann.com/uploads/2/5/8/2/25820564/country-year_dataset.xlsx')

Preview the data

library(dplyr)

# Look at the full data:
head(full_nuclear_data)
  country_name ccode                        facility_name facility_ambiguity
1      Algeria   615     Hot cell facility at Ain Oussera                  0
2    Argentina   160    Ezeiza – SF Reprocessing Facility                  0
3    Argentina   160 Ezeiza II – SF Reprocessing Facility                -77
4    Argentina   160     Pilcaniyeu Enrichment Facility I                  0
5    Argentina   160    Pilcaniyeu Enrichment Facility II                -77
6    Australia   900                        Lucas Heights                  0
  enr_type size construction_start construction_end operation_start
1        1    1               1986             1992            1992
2        1    1               1968             1968            1968
3        1    2               1978             1990            9999
4        2    2               1979             1987            1987
5        2    2               2000             9999            9999
6        3    1               1965             1972            1972
  operation_end operation2_start operation2_end covert iaea regional military
1          7777               NA             NA      1    1        0        0
2          1973               NA             NA      0    0        0        0
3          9999               NA             NA      0    0        1        0
4          1994               NA             NA      1    0        1        1
5          9999               NA             NA      0    1        1        0
6          1983               NA             NA      0    1        0        0
  military_ambiguity multinational foreign_assistance
1                  1             0                  1
2                  1             0                  0
3                  0             0                  0
4                  0             0                  0
5                  0             0                  0
6                  1             0                  0
  foreign_assistance_ambiguity
1                            0
2                            1
3                            1
4                            1
5                            1
6                            0

Preview the data:

glimpse(full_nuclear_data)
Rows: 253
Columns: 20
$ country_name                 <chr> "Algeria", "Argentina", "Argentina", "Arg…
$ ccode                        <dbl> 615, 160, 160, 160, 160, 900, 900, 211, 1…
$ facility_name                <chr> "Hot cell facility at Ain Oussera", "Ezei…
$ facility_ambiguity           <dbl> 0, 0, -77, 0, -77, 0, 0, 0, 0, 0, 0, 0, 0…
$ enr_type                     <dbl> 1, 1, 1, 2, 2, 3, 7, 1, 7, 3, 3, 3, 6, 1,…
$ size                         <dbl> 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 1, 3, 2, 1,…
$ construction_start           <dbl> 1986, 1968, 1978, 1979, 2000, 1965, 1982,…
$ construction_end             <dbl> 1992, 1968, 1990, 1987, 9999, 1972, 1992,…
$ operation_start              <dbl> 1992, 1968, 9999, 1987, 9999, 1972, 1992,…
$ operation_end                <dbl> 7777, 1973, 9999, 1994, 9999, 1983, 2007,…
$ operation2_start             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ operation2_end               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ covert                       <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,…
$ iaea                         <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
$ regional                     <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,…
$ military                     <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,…
$ military_ambiguity           <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
$ multinational                <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
$ foreign_assistance           <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,…
$ foreign_assistance_ambiguity <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,…
# can also use: full_nuclear_data |> glimpse() 

Preview the data:

# Look at the limited data:
head(nuclear_country_year)
  stateabb ccode year latency_lab latency_pilot
1      USA     2 1939           0             0
2      USA     2 1940           0             0
3      USA     2 1941           1             0
4      USA     2 1942           1             0
5      USA     2 1943           1             1
6      USA     2 1944           1             1
glimpse(nuclear_country_year)
Rows: 10,147
Columns: 5
$ stateabb      <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", …
$ ccode         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ year          <dbl> 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 19…
$ latency_lab   <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ latency_pilot <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Data Manipulation and Wrangling:

Data manipulation:

  • Focus is changing data structure

  • Uses specific methods/functions

  • Works with structured data

  • Ex: sorting, filtering, aggregating

  • Part of data wrangling

Data wrangling:

  • Broader process with cleaning and transformations

  • Structured and unstructured data

  • Missing values, multiple datasets, format conversion

  • More time intensive

More advanced data wrangling:

  • Advanced filtering and selection: Conditional filtering and dynamic column selection

  • Complex mutate operations: Creating conditional columns using lags and leads

  • Data reshaping: pivoting data, advanced grouping

  • Efficient data handling: joining datasets, parallel processing

Key Packages:

These are tidyverse R packages that are used for data science, which share a common design philosophy, grammar, and data structures:

  • ggplot2: data visualization

  • dplyr: data manipulation

  • tidyr: data tidying

  • readr: data import

  • purrr: functional programming

  • tibble: modern data frames

  • stringr: string manipulation

  • forcats: categorical data

Most commonly used dplyr functions:

  • filter(): subset rows based on conditions
  • select(): choose columns by name
  • mutate(): create new columns or modify existing ones
    • commonly used with if_else() or case_when()
  • arrange(): reorder rows
  • group_by(): group data for summary operations
  • summarize(): aggregate data

Return to the data:

Cleaning the data. If names were not clean, then we would clean the names, which you could do using the following:

library(janitor)
head(nuclear_country_year)
  stateabb ccode year latency_lab latency_pilot
1      USA     2 1939           0             0
2      USA     2 1940           0             0
3      USA     2 1941           1             0
4      USA     2 1942           1             0
5      USA     2 1943           1             1
6      USA     2 1944           1             1
head(clean_names(nuclear_country_year))
  stateabb ccode year latency_lab latency_pilot
1      USA     2 1939           0             0
2      USA     2 1940           0             0
3      USA     2 1941           1             0
4      USA     2 1942           1             0
5      USA     2 1943           1             1
6      USA     2 1944           1             1

Cleaning a few things:

The codebook tells us that for many of the variables, there are values that are put in to signify missing values or non-operational facilities.

  • It is very important to read through the codebook and look at summary stats of the data:
summary(full_nuclear_data)
 country_name           ccode       facility_name      facility_ambiguity
 Length:253         Min.   :  2.0   Length:253         Min.   :-77.000   
 Class :character   1st Qu.:200.0   Class :character   1st Qu.:  0.000   
 Mode  :character   Median :360.0   Mode  :character   Median :  0.000   
                    Mean   :371.7                      Mean   : -3.474   
                    3rd Qu.:645.0                      3rd Qu.:  0.000   
                    Max.   :900.0                      Max.   :  1.000   
                                                                         
    enr_type          size      construction_start construction_end
 Min.   :1.000   Min.   :1.00   Min.   : -99       Min.   : -99    
 1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1943       1st Qu.:1954    
 Median :2.000   Median :2.00   Median :1960       Median :1973    
 Mean   :2.621   Mean   :2.02   Mean   :1513       Mean   :2081    
 3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.:1982       3rd Qu.:1988    
 Max.   :8.000   Max.   :3.00   Max.   :2012       Max.   :9999    
                                                                   
 operation_start operation_end  operation2_start operation2_end
 Min.   : -99    Min.   : -99   Min.   :1969     Min.   :1972  
 1st Qu.:1956    1st Qu.:1978   1st Qu.:1981     1st Qu.:1988  
 Median :1974    Median :1993   Median :1983     Median :2004  
 Mean   :2328    Mean   :3993   Mean   :1988     Mean   :3919  
 3rd Qu.:1988    3rd Qu.:7777   3rd Qu.:1996     3rd Qu.:7777  
 Max.   :9999    Max.   :9999   Max.   :2011     Max.   :7777  
                                NA's   :244      NA's   :244   
     covert              iaea            regional           military       
 Min.   :-99.0000   Min.   :-99.000   Min.   :-99.0000   Min.   :-99.0000  
 1st Qu.:  0.0000   1st Qu.:  0.000   1st Qu.:  0.0000   1st Qu.:  0.0000  
 Median :  1.0000   Median :  0.000   Median :  0.0000   Median :  1.0000  
 Mean   : -0.2609   Mean   : -1.617   Mean   : -0.6087   Mean   : -0.9565  
 3rd Qu.:  1.0000   3rd Qu.:  1.000   3rd Qu.:  0.0000   3rd Qu.:  1.0000  
 Max.   :  1.0000   Max.   :  1.000   Max.   :  1.0000   Max.   :  1.0000  
                                                                           
 military_ambiguity multinational     foreign_assistance
 Min.   :0.0000     Min.   :0.00000   Min.   :-99.0000  
 1st Qu.:0.0000     1st Qu.:0.00000   1st Qu.:  0.0000  
 Median :0.0000     Median :0.00000   Median :  0.0000  
 Mean   :0.1897     Mean   :0.09091   Mean   : -0.1146  
 3rd Qu.:0.0000     3rd Qu.:0.00000   3rd Qu.:  1.0000  
 Max.   :1.0000     Max.   :1.00000   Max.   :  1.0000  
                                                        
 foreign_assistance_ambiguity
 Min.   :0.0000              
 1st Qu.:0.0000              
 Median :0.0000              
 Mean   :0.1028              
 3rd Qu.:0.0000              
 Max.   :1.0000              
                             

Just looking at specific variables:

# just look at one variable:
summary(full_nuclear_data$military)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-99.0000   0.0000   1.0000  -0.9565   1.0000   1.0000 
# look at multiple variables:
summary(full_nuclear_data |> select(military, size))
    military             size     
 Min.   :-99.0000   Min.   :1.00  
 1st Qu.:  0.0000   1st Qu.:1.00  
 Median :  1.0000   Median :2.00  
 Mean   : -0.9565   Mean   :2.02  
 3rd Qu.:  1.0000   3rd Qu.:3.00  
 Max.   :  1.0000   Max.   :3.00  

Tabulating and Cross-Tabbing the data:

# tabulate technology employed at nuclear plant:
table(full_nuclear_data$enr_type)

  1   2   3   4   5   6   7   8 
112  26  67   7   7   7  23   4 
# cross-table of technology employed based on civilian/military status:
table(full_nuclear_data$enr_type, full_nuclear_data$military)
   
    -99  0  1
  1   2 44 66
  2   0  2 24
  3   2 27 38
  4   0  0  7
  5   0  2  5
  6   0  5  2
  7   0 15  8
  8   0  0  4

Exercise:

Question 1: What share of nuclear and plants in Europe are military as opposed to civilian? What share broken out by technology type?

What do we need to do?

  1. Filter to the countries of interest
  2. Select the variables that are important to us
  3. Handle missing/make sure the data is clean
  4. Group by technology type/military status
  5. Summarize

Look at the countries:

table(full_nuclear_data$country_name)

       Algeria      Argentina      Australia        Belgium         Brazil 
             1              4              2              1              7 
        Canada          China Czech Republic          Egypt         France 
             3             18              1              1             24 
       Germany          India           Iran           Iraq         Israel 
             8             10             10              9              4 
         Italy          Japan          Libya    Netherlands    North Korea 
             4              9              3              5              3 
        Norway       Pakistan        Romania         Russia   South Africa 
             2              8              1             32              5 
   South Korea          Spain         Sweden         Taiwan United Kingdom 
             4              1              2              3             20 
 United States     Yugoslavia 
            44              4 
# select just European countries:
europe_nuclear_data <- full_nuclear_data |> 
  filter(country_name %in% 
           c("Belgium", "Czech Republic", "France", "Germany",
             "Italy", "Netherlands", "Norway", "Romania",
             "Spain", "Sweden", "United Kingdom", "Yugoslavia"))

Limit to the variables we need:

# subset to the variables that we need:
europe_nuclear_data <- europe_nuclear_data |> 
  select(country_name, enr_type, military)

# get summary stats of the dataframe:
summary(europe_nuclear_data)
 country_name          enr_type        military      
 Length:73          Min.   :1.000   Min.   :-99.000  
 Class :character   1st Qu.:1.000   1st Qu.:  0.000  
 Mode  :character   Median :1.000   Median :  0.000  
                    Mean   :2.082   Mean   : -2.301  
                    3rd Qu.:3.000   3rd Qu.:  1.000  
                    Max.   :7.000   Max.   :  1.000  

Look at military:

We see that the share of military is negative if we look at summary stats. If we wanted to figure out the share that are military, right now, it wouldn’t make any sense.

# recode the missing observaitons:
military_summary_data <- europe_nuclear_data |> 
  # recode the missing military to be NA
  mutate(military = if_else(military == -99, NA, military))

# look at the result:
mean(military_summary_data$military)
[1] NA

Why didn’t this work? We need to remove NAs:

mean(military_summary_data$military, na.rm = T)
[1] 0.4225352

Putting it all together

How many military and civilian nuclear plants are there by power type?

# compute the number and share of military plants by power type
europe_power_type <- europe_nuclear_data |>
  # recode the missing to be NA:
  mutate(military = if_else(military == -99, NA, military)) |>
  # remove the missings:
  filter(!is.na(military)) %>%
  #group by power type:
  group_by(enr_type) %>%
  # summarise the share of military
  summarise(military = mean(military), # the share of plants that are military
            n_plants = n()) # the total number of plants by type

# look at the data:
europe_power_type
# A tibble: 7 × 3
  enr_type military n_plants
     <dbl>    <dbl>    <int>
1        1   0.455        44
2        2   0.667         3
3        3   0.0769       13
4        4   1             2
5        5   1             4
6        6   0             2
7        7   0.333         3

Make it look a little cleaner:

# add strings to teh pwoer type
europe_power_type  <- europe_power_type |>
  # create a variable with the name as a string instead of a number
  mutate(power_type = case_when(
         enr_type == 1 ~ "Reprocessing",
         enr_type == 2 ~ "Gaseous diffusion",
         enr_type == 3 ~ "Centrifuge",
         enr_type == 4 ~ "EMIS",
         enr_type == 5 ~ "Chemical & ion exchange",
         enr_type == 6 ~ "Aerodynamic isotope separation",
         enr_type == 7 ~ "Laser",
         enr_type == 8 ~ "Thermal diffusion")) |>
  # remove the number
  select(-enr_type) |>
  # sort by the number of plants (descending)
  arrange(-n_plants) |>
  # make sure that power_type is the first variable:
  select(power_type, everything())

# look at the results:
europe_power_type
# A tibble: 7 × 3
  power_type                     military n_plants
  <chr>                             <dbl>    <int>
1 Reprocessing                     0.455        44
2 Centrifuge                       0.0769       13
3 Chemical & ion exchange          1             4
4 Gaseous diffusion                0.667         3
5 Laser                            0.333         3
6 EMIS                             1             2
7 Aerodynamic isotope separation   0             2

Recap:

What have we done today?

  1. Started to get familiar with R
  2. Learned about good coding practices
  3. Previewed the tidyverse
  4. Work through a practice exercise

Comment early and often!

Install R on Your Machine

Please attempt this prior to the start of next lecture. If you successfully install R and RStudio without any issues, feel free to arrive at 9:15am on Thursday to allow us time to help other students

To install R, you should install both R and RStudio. To download and install R, you should click the following:

Download and Install R

  • Once you get to this website, click the “Download R for macOS” or “Download R for Windows” depending on your machine.

  • To get the R Studio interface, you can click on the same link as above and then click “Download RStudio Desktop.”

If you would like step by step instructions, you can do so here.

Please tell the teaching team if you need assistance!